library(qrcode)
plot(qr_code('https://pfagandini.github.io/novasbe_econometrics/R_SLR_Lecture_Slides.html'))From Theory to Practice in R
In the previous lectures we covered the algebra of Simple Linear Regression on the whiteboard.
Today we bring it to life with data, code, and simulations.
lm() — R does it for ussummary()Go to this website, and login however you like, I log in with my google account.
https://www.kaggle.com
Click in “Create”
And select “Notebook”
Sometimes it defaults to Python, but it learns your preference (use R often and it will start creating R notebooks by default)
To check/set an R environment:
mtcars datasetBuilt into R: data on 32 automobiles (1974 Motor Trend magazine).
Our question: How does the weight of a car affect its fuel consumption?
\[\text{mpg}_i = \beta_0 + \beta_1 \cdot \text{wt}_i + u_i\]
mpg wt hp cyl
Mazda RX4 21.0 2.620 110 6
Mazda RX4 Wag 21.0 2.875 110 6
Datsun 710 22.8 2.320 93 4
Hornet 4 Drive 21.4 3.215 110 6
Hornet Sportabout 18.7 3.440 175 8
Valiant 18.1 3.460 105 6
Duster 360 14.3 3.570 245 8
Merc 240D 24.4 3.190 62 4
par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))
hist(y, breaks = 10, col = "steelblue", border = "white",
main = "Distribution of mpg", xlab = "Miles per Gallon")
hist(x, breaks = 10, col = "coral", border = "white",
main = "Distribution of Weight", xlab = "Weight (1000 lbs)")
plot(x, y, pch = 19, col = "steelblue",
main = "mpg vs Weight", xlab = "Weight (1000 lbs)", ylab = "mpg")
grid()Strong negative linear relationship. Heavier cars tend to have lower fuel efficiency.
Question: Looking at the scatter plot, does a straight line seem reasonable? What sign should the slope have?
\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}\]
\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]
Let’s compute this step by step.
car x y dx dy dx_times_dy
1 Mazda RX4 2.620 21.0 -0.597 0.909 -0.543
2 Mazda RX4 Wag 2.875 21.0 -0.342 0.909 -0.311
3 Datsun 710 2.320 22.8 -0.897 2.709 -2.431
4 Hornet 4 Drive 3.215 21.4 -0.002 1.309 -0.003
5 Hornet Sportabout 3.440 18.7 0.223 -1.391 -0.310
6 Valiant 3.460 18.1 0.243 -1.991 -0.483
β̂₀ = 37.2851
β̂₁ = -5.3445
For each additional 1000 lbs of weight,
mpg decreases by 5.34 miles per gallon
on average, ceteris paribus.
plot(x, y, pch = 19, col = "steelblue", cex = 1.3,
main = "OLS Regression Line (computed by hand)",
xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon")
abline(a = beta0_hat, b = beta1_hat, col = "red", lwd = 2.5)
points(x_bar, y_bar, pch = 4, col = "red", cex = 2.5, lwd = 3)
text(x_bar + 0.15, y_bar + 1.2,
expression(paste("(", bar(x), ", ", bar(y), ")")), col = "red", cex = 1.1)
grid()
legend("topright",
legend = c(paste0("ŷ = ", round(beta0_hat, 2), " + (",
round(beta1_hat, 2), ") × wt"), "Point of means"),
col = c("red", "red"), lty = c(1, NA), pch = c(NA, 4),
lwd = c(2.5, 3), bty = "n")The OLS regression line always passes through the point of means \((\bar{x}, \bar{y})\).
Why? From the formula:
\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]
Substitute \(x = \bar{x}\):
\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 \bar{x} = (\bar{y} - \hat{\beta}_1 \bar{x}) + \hat{\beta}_1 \bar{x} = \bar{y}\]
y_hat <- beta0_hat + beta1_hat * x
resid <- y - y_hat
plot(x, y, pch = 19, col = "steelblue", cex = 1.3,
main = "Residuals: the vertical distances",
xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon")
abline(a = beta0_hat, b = beta1_hat, col = "red", lwd = 2)
segments(x0 = x, y0 = y, x1 = x, y1 = y_hat, col = "gray40", lty = 2)
points(x, y_hat, pch = 1, col = "red", cex = 1)
grid()
legend("topright",
legend = c("Observed (y)", "Fitted (ŷ)", "Residual (y − ŷ)"),
col = c("steelblue", "red", "gray40"),
pch = c(19, 1, NA), lty = c(NA, NA, 2), bty = "n")Sum of residuals: 0
Cor(x, residuals): 0
Mean of fitted values: 20.0906
Mean of y: 20.0906
→ They are equal!
lm() — R Does It for Us\[\hat{\sigma}^2 = \frac{\text{SSR}}{n - 2} = \frac{\sum_{i=1}^{n} \hat{u}_i^2}{n - 2}\]
SSR = 278.3219
df = n - 2 = 30
σ̂² = 9.2774
σ̂ = 3.0459 (Residual Standard Error)
\[\text{se}(\hat{\beta}_1) = \frac{\hat{\sigma}}{\sqrt{S_{xx}}}\]
Under \(H_0: \beta_j = 0\): \(\quad t = \frac{\hat{\beta}_j}{\text{se}(\hat{\beta}_j)} \sim t_{n-2}\)
Estimate Std.Error t-value p-value
β̂₁ -5.3445 0.5591 -9.559 1.29e-10
alpha <- 0.05
t_crit <- qt(1 - alpha/2, df = df)
curve(dt(x, df = df), from = -15, to = 15, n = 500, lwd = 2, col = "steelblue",
main = expression(paste("t-test for ", H[0], ": ", beta[1], " = 0")),
xlab = "t", ylab = "Density")
x_left <- seq(-15, -t_crit, length.out = 200)
x_right <- seq(t_crit, 15, length.out = 200)
polygon(c(x_left, rev(x_left)), c(dt(x_left, df), rep(0, 200)),
col = rgb(1, 0, 0, 0.3), border = NA)
polygon(c(x_right, rev(x_right)), c(dt(x_right, df), rep(0, 200)),
col = rgb(1, 0, 0, 0.3), border = NA)
abline(v = t_beta1, col = "red", lwd = 2.5, lty = 2)
text(t_beta1 + 1.5, 0.15, paste0("t = ", round(t_beta1, 2)),
col = "red", cex = 1.2, font = 2)
abline(v = c(-t_crit, t_crit), col = "gray40", lty = 3)
legend("topleft",
legend = c("t(30) distribution", "Rejection region (α = 0.05)", "Our t-statistic"),
col = c("steelblue", rgb(1,0,0,0.3), "red"),
lwd = c(2, 8, 2.5), lty = c(1, 1, 2), bty = "n")\[\hat{\beta}_j \pm t_{\alpha/2, \, n-2} \cdot \text{se}(\hat{\beta}_j)\]
Critical value t_{0.025, 30} = 2.0423
95% CI for β₁: [ -6.4863 , -4.2026 ]
Note: 0 is NOT inside the CI for β₁ → we reject H₀: β₁ = 0
summary()
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Everything matches our hand calculations.
par(mar = c(4, 4, 3, 1))
barplot(c(SST = SST, SSE = SSE, SSR = SSR),
col = c("gray70", "steelblue", "coral"),
main = "Decomposition: SST = SSE + SSR", ylab = "Sum of Squares", border = NA)
text(0.7, SST/2, paste0("SST\n", round(SST, 1)), cex = 1, font = 2)
text(1.9, SSE/2, paste0("SSE\n", round(SSE, 1), "\n(", round(SSE/SST*100,1), "%)"),
cex = 1, font = 2)
text(3.1, SSR/2, paste0("SSR\n", round(SSR, 1), "\n(", round(SSR/SST*100,1), "%)"),
cex = 1, font = 2)We said \(\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}\) depends on the sample.
If we took a different sample from the same population, we would get a different \(\hat{\beta}_1\).
Let’s simulate this!
beta0_true <- 5; beta1_true <- 3; sigma_true <- 4
set.seed(42)
n_sim <- 1000; n_obs <- 50
x_sim <- runif(n_obs, min = 0, max = 10)
beta1_estimates <- numeric(n_sim)
beta0_estimates <- numeric(n_sim)
for (i in 1:n_sim) {
u <- rnorm(n_obs, mean = 0, sd = sigma_true)
y_sim <- beta0_true + beta1_true * x_sim + u
fit <- lm(y_sim ~ x_sim)
beta0_estimates[i] <- coef(fit)[1]
beta1_estimates[i] <- coef(fit)[2]
}
cat("Done:", n_sim, "regressions, each with n =", n_obs, "observations.\n")Done: 1000 regressions, each with n = 50 observations.
plot(NULL, xlim = c(0, 10), ylim = c(-5, 45),
main = "20 Different Samples → 20 Different Regression Lines",
xlab = "x", ylab = "y")
for (i in 1:20) {
abline(a = beta0_estimates[i], b = beta1_estimates[i],
col = rgb(0.3, 0.3, 0.8, 0.3), lwd = 1.5)
}
abline(a = beta0_true, b = beta1_true, col = "red", lwd = 3)
legend("topleft",
legend = c("True line: y = 5 + 3x", "Estimated lines (one per sample)"),
col = c("red", rgb(0.3, 0.3, 0.8, 0.5)), lwd = c(3, 2), bty = "n")
grid()se_theory <- sigma_true / sqrt(sum((x_sim - mean(x_sim))^2))
hist(beta1_estimates, breaks = 40, freq = FALSE,
col = "steelblue", border = "white",
main = expression(paste("Sampling Distribution of ", hat(beta)[1],
" (1000 simulations)")),
xlab = expression(hat(beta)[1]),
xlim = c(beta1_true - 1.5, beta1_true + 1.5))
abline(v = beta1_true, col = "red", lwd = 3)
abline(v = mean(beta1_estimates), col = "orange", lwd = 2, lty = 2)
curve(dnorm(x, mean = beta1_true, sd = se_theory), add = TRUE,
col = "red", lwd = 2, lty = 2)
legend("topright",
legend = c(paste0("True β₁ = ", beta1_true),
paste0("Mean(β̂₁) = ", round(mean(beta1_estimates), 4)),
"Theoretical N(β₁, σ²/Sxx)"),
col = c("red", "orange", "red"),
lwd = c(3, 2, 2), lty = c(1, 2, 2), bty = "n")True β₁ = 3
Mean of β̂₁ = 2.9945
→ UNBIASED: E[β̂₁] ≈ β₁ ✓
Std Dev of β̂₁ (simulated) = 0.1932
Std Dev of β̂₁ (theory) = 0.1882
→ MATCHES: se(β̂₁) = σ/√Sxx ✓
True β₀ = 5
Mean of β̂₀ = 5.022
→ UNBIASED: E[β̂₀] ≈ β₀ ✓
\[\text{se}(\hat{\beta}_1) = \frac{\sigma}{\sqrt{S_{xx}}}\]
Three things affect precision:
set.seed(123)
sample_sizes <- c(20, 50, 100, 500)
n_reps <- 1000
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
for (nn in sample_sizes) {
x_temp <- runif(nn, 0, 10)
b1_vec <- numeric(n_reps)
for (j in 1:n_reps) {
u_temp <- rnorm(nn, 0, sigma_true)
y_temp <- beta0_true + beta1_true * x_temp + u_temp
b1_vec[j] <- coef(lm(y_temp ~ x_temp))[2]
}
hist(b1_vec, breaks = 35, freq = FALSE, col = "steelblue", border = "white",
main = paste0("n = ", nn, " (sd = ", round(sd(b1_vec), 3), ")"),
xlab = expression(hat(beta)[1]), xlim = c(1.5, 4.5))
abline(v = beta1_true, col = "red", lwd = 2)
}set.seed(456)
sigmas <- c(1, 4, 8, 16)
nn_fixed <- 50; x_temp <- runif(nn_fixed, 0, 10)
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
for (ss in sigmas) {
b1_vec <- numeric(n_reps)
for (j in 1:n_reps) {
u_temp <- rnorm(nn_fixed, 0, ss)
y_temp <- beta0_true + beta1_true * x_temp + u_temp
b1_vec[j] <- coef(lm(y_temp ~ x_temp))[2]
}
hist(b1_vec, breaks = 35, freq = FALSE, col = "coral", border = "white",
main = bquote(sigma == .(ss) ~ " (sd = " * .(round(sd(b1_vec), 3)) * ")"),
xlab = expression(hat(beta)[1]), xlim = c(-1, 7))
abline(v = beta1_true, col = "red", lwd = 2)
}More data → tighter distribution.
More noise → wider distribution.
In both cases, the estimator remains centered on the true value (unbiased).
set.seed(2024)
plot(mtcars$wt, mtcars$mpg, pch = 19, col = "gray70", cex = 1.2,
main = "8 Random Subsamples (n = 20) from mtcars",
xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon")
colors <- c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",
"#FF7F00", "#A65628", "#F781BF", "#999999")
sub_betas <- numeric(8)
for (i in 1:8) {
idx <- sample(1:32, size = 20, replace = FALSE)
fit_sub <- lm(mpg ~ wt, data = mtcars[idx, ])
abline(fit_sub, col = colors[i], lwd = 2)
sub_betas[i] <- coef(fit_sub)[2]
}
abline(model, col = "black", lwd = 3, lty = 2)
legend("topright",
legend = c("Full sample (n=32)", "Subsample lines (n=20)"),
col = c("black", "steelblue"), lwd = c(3, 2), lty = c(2, 1), bty = "n")
grid()par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
plot(fitted(model), residuals(model), pch = 19, col = "steelblue",
main = "Residuals vs Fitted", xlab = "Fitted Values (ŷ)", ylab = "Residuals (û)")
abline(h = 0, col = "red", lwd = 2, lty = 2); grid()
hist(residuals(model), breaks = 10, freq = FALSE,
col = "steelblue", border = "white",
main = "Distribution of Residuals", xlab = "Residuals")
curve(dnorm(x, mean = 0, sd = sigma_hat), add = TRUE, col = "red", lwd = 2)
qqnorm(residuals(model), pch = 19, col = "steelblue", main = "Normal Q-Q Plot")
qqline(residuals(model), col = "red", lwd = 2)
plot(fitted(model), sqrt(abs(rstandard(model))), pch = 19, col = "steelblue",
main = "Scale-Location", xlab = "Fitted Values",
ylab = expression(sqrt("Standardized Residuals")))
grid()x_grid <- data.frame(wt = seq(min(mtcars$wt) - 0.2, max(mtcars$wt) + 0.2,
length.out = 200))
pred_ci <- predict(model, newdata = x_grid, interval = "confidence")
pred_pi <- predict(model, newdata = x_grid, interval = "prediction")
plot(mtcars$wt, mtcars$mpg, pch = 19, col = "steelblue", cex = 1.2,
main = "Confidence and Prediction Intervals",
xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon",
ylim = c(min(pred_pi[,"lwr"]) - 1, max(pred_pi[,"upr"]) + 1))
polygon(c(x_grid$wt, rev(x_grid$wt)),
c(pred_pi[,"lwr"], rev(pred_pi[,"upr"])),
col = rgb(1, 0.6, 0.4, 0.2), border = NA)
polygon(c(x_grid$wt, rev(x_grid$wt)),
c(pred_ci[,"lwr"], rev(pred_ci[,"upr"])),
col = rgb(0.3, 0.5, 0.8, 0.3), border = NA)
lines(x_grid$wt, pred_ci[,"fit"], col = "red", lwd = 2.5)
points(mtcars$wt, mtcars$mpg, pch = 19, col = "steelblue", cex = 1.2)
legend("topright",
legend = c("OLS line", "95% CI for E[Y|X]", "95% PI for new Y"),
col = c("red", rgb(0.3,0.5,0.8,0.5), rgb(1,0.6,0.4,0.4)),
lwd = c(2.5, 8, 8), bty = "n")
grid()Confidence interval (blue): Where the true regression line likely is. Narrower near \(\bar{x}\).
Prediction interval (orange): Where a new observation would fall. Always wider because it includes both line uncertainty and the noise \(\sigma^2\).
| Task | Code |
|---|---|
| Fit model | lm(y ~ x, data = df) |
| Coefficients | coef(model) |
| Full output | summary(model) |
| CIs | confint(model) |
| Fitted values | fitted(model) |
| Residuals | residuals(model) |
| Predict | predict(model, newdata) |
| \(R^2\) | summary(model)$r.squared |